Web Scraping 
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Objectives 


ile 


Parse HTML and CSS elements in 
webpages 


Use requests and BeautifulSoup to 
get and process webpage contents 


Use ethics when scraping 
websites 


HyperText Markup Language (HTML) 


If the HTTP response can only contain strings, how does the 
browser know how to display a website? 


Markup! 


e Everything the browser needs to know is embedded into 
one big string 

e The string is structured with a hierarchy of tags that 
represent each component to be rendered 


WEB BROWSER 


<!DOCTYPE html> hi { 

<html> font-size: 20pt; 

<head> color: red; H H 

<link rel="stylesheet” type="- ) This IS a header 


text/css” href="styles.css" /> 
</head> : Some text for my paragraph 
f 


<body> 
<h1>This is a header</hl> 


<p>Some text for my para- 
graph</p> 


</body> 


</html> 





https://artvaark-design.ie/how-does-wordpress-work/ 


Why Scrape? Eei Sour gem 


Jobs w Immigration v Travel v Business v Benefits v Health v Taxes v More services w 





Home > Import, Export and Investment = Trade data online 





Report - Trade Data Online 


Help | Return to Trade Data Online 


Report date: 2020-02-05 


Criteria 


Title Canadian total exports 
Industries Naics 11111 - soybean farming 
Origin Canada 
Destination All countries (total) 
Period Latest 5 years 
Units Value in millions of canadian dollars 





Change criteria 
Report 
2014 2015 2016 2017 2018 
ENTIRE GUESTHOUSE 1 BED All Countries (Total) 1,987 2,359 2,542 2,499 2,889 
Small Bungalow w/Private Entrance k eee ee 
Č Share this page 


$75 per night - Free cancellation 
kkk kk 363 - Superhost Save report as CSV | Save report as Excel 





BarkBox Small Rainfurrest Dog Toy & 
Treat Bundle Assortment - Plush Toys, 
Chew Toys, Squeak Toys, All-Natural 
Treats/Chews Made in The USA 


rr rz v 234 
$31% 


vprime 
Get it by TODAY, Feb 21 
FREE Shipping on eligible orders 


Dog Chew Rope Toys Knotted Clean 
Teeth Cotton for Aggressive Chewers 
Pack of 3 (Blue-White) 

WARE +57 

$999 

„prime 


Get it by Sat, Feb 23 
FREE Shipping on eligible orders 


31.86 BarkBox Small Rainfurrest Dog Toy & Treat Bund... 


9.99 Dog Chew Rope Toys Knotted Clean Teeth Cotton ... 





Why Not Scrape? 


csv 





Downloadable Database Public 
Dataset Connection REST API 


Ethical Considerations 


e Terms of Service 
e Denial of Service Attacks 


e Confidentiality 
This article discusses legal issues related to web scraping 


We are not lawyers - this does not constitute legal advice. 


Python Tools for Web Scraping 


Name Fetch HTML | Parse HTML | Notes 

urllib X Python standard library 
requests X Lightweight, fast, easy 
BeautifulSoup X Fast, can parse any XML file 
scrapy X X Complex, powerful 


selenium X X Automates a real web browser 


Demo in Postman HTTP Client 
To download this yourself, go to: 
https://www.getpostman.com/downloads/ 


| recommend that everyone downloads this tool or an 
alternative with similar features 


CSS Selector Basics 


What are CSS Selectors? 


e Used by Web Developers to group elements for appearance 
e CSS Selectors are always strings 


Rules for determining CSS selectors 


Start with HTML Element Type 
. (period) means “with the following class” 
# (hash symbol) means “with the following id” 


© 
© 
© 
e Spaces indicate hierarchy 


CSS Selector Example 


Element: 
<div id="masterSearchResults">...</div> 


CSS Selector: 
"divHmasterSearchResults" 


Rules for determining CSS selectors 


e Start with HTML Element Type 
e (period) means “with the following class” 
e # (hash symbol) means “with the following id” 


CSS Selector Example 


Element: 
<ul id="authorList"><li>Stephen King</li></ul> 


CSS Selector: 
"ulHauthorList li“ 


Rules for determining CSS selectors 


e Start with HTML Element Type 
e (period) means “with the following class” 
e # (hash symbol) means “with the following id” 


